Q-BERT: Hessian-Based Ultra Low-Precision Quantization of BERT
125
5.3
Q-BERT: Hessian-Based Ultra Low-Precision Quantization of
BERT
Shen et al. [209] proposes Q-BERT, a low precision uniform quantization method that
utilizes the second order Hessian information. In particular, a Hessian-based mix-precision
method and a new group-wise quantization scheme are introduced.
5.3.1
Hessian-Based Mix-Precision
Due to different encoder layers attending to different structures and exhibiting different sen-
sitivity to quantization [45], the authors argue that assigning the same number of bits to all
the layers is sub-optimal. Thus, they explore mixed-precision quantization, where more bits
are assigned to more sensitive layers to retain performance. Previous method has developed
a Hessian AWare Quantization (HAWQ) [59] to determine mixed-bits assignments for each
layer. Its main idea is that the parameters in layers with higher Hessian spectrum (i.e.,
larger top eigenvalues) are more sensitive to quantization and require higher precision than
layers with small Hessian spectrum. However, they argue that the number of parameters for
each encoder layer in a transformer-based model is larger, e.g., 7M. Given that the Hessian
of each layer is a matrix of size 7M ×7M, directly compute the second-order statistics is in-
feasible. However, the authors adopt a matrix-free power iteration method [270] to compute
the Hessian spectrum, which does not require the explicit formation of the operator. The
matrix-free power iteration method can provide the top eigenvalues, which are then used
as the indicator of the sensitivity of a layer. The previous method [59] uses the averaged
top eigenvalues for different training data as the indicator. More aggressive quantization is
performed for layers with smaller top eigenvalues, corresponding to a flatter loss landscape.
However, the authors find that assigning bits based only on the average top eigenvalues
is infeasible for many NLP tasks, due to the top eigenvalues of Hessian for some layers
exhibiting very high variance with respect to different portions of the input dataset. To
address this, the following metric instead of just using mean value is adopted:
Ωi = |mean(λi)| + std(λi),
(5.9)
where λi is the distribution of the top eigenvalues of Hessian of layer i, calculated with
10% of training dataset. After Ωi is computed, they sort them in descending order, and use
it as a metric to relatively determine the quantization precision. Then, quantization-aware
finetuning is performed based on the selected precision setting. The eigenvalue distribution
of various datasets are provided in Fig. 5.5.
5.3.2
Group-Wise Quantization
For Bert-base, the dimension of each input token is 768 and each self-attention head has 4
dense matrices. Directly quantizing the 4 matrices in multi-head self-attention as an entirety
with the same quantization range can significantly degrade the accuracy, since there are
more than 2M parameters in total, and the weights corresponding to each neuron may lie
in a different range of full-precision numbers. Although channel-wise quantization can be
used to alleviate this problem in CNNs, each convolutional kernel can be treated as a single
output channel with its own quantization range. However, because each dense matrix used
in the transformer-based model adopts a single kernel, channel-wise quantization cannot
be directly applied. Therefore, the authors propose group-wise quantization for attention-
based models. In particular, each individual matrix W with respect to each head in one